Skip to content

feat(cli): add the 'clerk webhooks' command group#323

Draft
rafa-thayto wants to merge 42 commits into
mainfrom
rafa-thayto/webhooks
Draft

feat(cli): add the 'clerk webhooks' command group#323
rafa-thayto wants to merge 42 commits into
mainfrom
rafa-thayto/webhooks

Conversation

@rafa-thayto

Copy link
Copy Markdown
Contributor

Summary

Adds the full clerk webhooks command group (13 commands) per the final spec: CRUD, delivery inspection, local forwarding via the Svix relay, replay, offline signature verification, and portal open.

  • CRUD & inspection: list, get, create, update, delete, secret [--rotate], event-types, messages
  • Delivery loop: listen (Svix relay WebSocket, persistent per-instance endpoint, HMAC verification, local forwarding with per-delivery diagnostics), trigger (validates event type first), replay (single message or bulk --since [--until] recovery)
  • Offline: verify — pure HMAC-SHA256 check, no auth gate, consumes listen NDJSON event lines via --delivery @file|-
  • Plumbing: 4 new ERROR_CODE entries, per-instance relay config, typed PLAPI functions for the 13 new routes (--iteratorstarting_after wire translation), group-level --app/--instance/--json with an auth preAction hook that exempts verify

Agent contract: bare domain JSON on stdout via log.data(), structured {"error":{code,…}} on stderr, exit codes 0/1/2/130, NDJSON for listen. Destructive commands (delete, secret --rotate, replay --since) prompt in human mode and require --yes in agent mode — validated before any network call.

Notable fixes that came out of the verification passes:

  • trigger validates the event type before endpoint resolution so agents always get unknown_event_type
  • listen gates delivery processing until the signing secret is fetched (no false verification warnings during startup)
  • the implicit piped-stdin --input-json expansion stands down when a literal - is in argv, unblocking verify --delivery - pipes

The 13 PLAPI routes are being built in parallel in clerk_go; unit tests mock the PLAPI layer.

Test plan

  • bun run format / lint / typecheck clean
  • bun run test — 1846 tests pass (187 in the webhooks group)
  • Live agent-flow smoke matrix (CLERK_MODE=agent, isolated CLERK_CONFIG_DIR): verify success/mismatch/usage errors, stdin pipes, fail-fast --yes gates, structured API/auth error shapes
  • E2E against real PLAPI once the clerk_go routes land

@changeset-bot

changeset-bot Bot commented Jun 9, 2026

Copy link
Copy Markdown

🦋 Changeset detected

Latest commit: dba0f5b

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 1 package
Name Type
clerk Minor

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@rafa-thayto

Copy link
Copy Markdown
Contributor Author

!snapshot

@github-actions

Copy link
Copy Markdown
Contributor

Snapshot published

npm install -g clerk@2.0.1-snapshot.9f8329d
Package Version
clerk 2.0.1-snapshot.9f8329d

Published from 9f8329d

@rafa-thayto rafa-thayto force-pushed the rafa-thayto/webhooks branch 4 times, most recently from 71a9dc7 to 7a249e6 Compare June 18, 2026 12:14
The flaky E2E failures were Nuxt's beforeAll hitting the 300s budget. Two
distinct stalls shared one opaque "hook timed out" signature: one CI run hung
in `clerk link` (an untimed `fetch()` to the production Clerk API), another in
`git init`.

- Add a default 60s timeout to `loggedFetch`, composed with any caller signal
  via `AbortSignal.any` so tighter budgets (keyless's 15s) still win. A stalled
  connection now fails fast across every CLI command, not just in tests.
- Wrap each fixture setup step (git / clerk link / clerk init / npm ci) in a
  per-step timeout that fails with a labeled error instead of silently eating
  the whole 300s budget.
- Cap e2e `--parallel=4` to cut startup contention; add an explicit afterEach
  cleanup budget and `npm ci --no-audit --no-fund`.
- Drop noisy success-path debug traces; keep failure diagnostics.

Claude-Session: https://claude.ai/code/session_01V1YkHZ2Ad1okwkX9bxTYsd
Address PR review feedback:

- `runStep` now spawns each setup step via `Bun.spawn` with an `AbortSignal`
  (Bun.$ can't be cancelled), so a timed-out git/clerk/npm step is killed
  instead of orphaned and left to race teardown. Adds runStep unit tests.
- fetch timeout test now fails if `loggedFetch` resolves instead of rejecting
  (no more false pass via swallowed error).
- Trim verbose comments.

Claude-Session: https://claude.ai/code/session_01V1YkHZ2Ad1okwkX9bxTYsd
The Bun.spawn rewrite (5ce158a) regressed the E2E job: 3 fixtures hung the
full 300s in beforeAll with no per-step timeout recovering, because reading a
killed child's piped stderr to EOF can block when a grandchild keeps the pipe
open. Restore the prior approach, which passed E2E in 52s:

- setup steps use Bun.$ again, wrapped in the Promise.race `withStepTimeout`
  (a timed-out step's subprocess is left to settle — beforeAll is never
  retried, so it can't cascade).
- drop the runStep Bun.spawn helper and its unit test.

The real root-cause fix (the 60s loggedFetch timeout that bounds a stalled
clerk link/init network call at the source) is unchanged.

Claude-Session: https://claude.ai/code/session_01V1YkHZ2Ad1okwkX9bxTYsd
The Bun.spawn `runStep` rewrite (5ce158a) regressed CI. `clerk init` runs an
internal `npm install` with inherited stderr (init/heuristics.ts installSdk), so
when the per-step AbortSignal SIGKILLed the CLI, the npm grandchild survived
holding the stderr pipe open — `new Response(proc.stderr).text()` never EOF'd,
the timeout never threw, and the 300s beforeAll fired instead. 3 fixtures hung.

Root realization: `clerk init` and `npm ci` do package installs whose duration
scales with CI contention, so any fixed per-step budget false-fails under load
(clerk init blew past its 90s budget in the failing run). You can't fix
contention-driven flakiness by capping variable-duration install work tighter.

Fix: remove per-step timeouts entirely. The real root-cause fix — the 60s
loggedFetch timeout — still bounds the only thing that can truly hang (network
calls); `--parallel=4` cuts contention; the 300s beforeAll is the backstop.
Setup steps return to plain Bun.$ (as on main). Removes runStep and its test.

Claude-Session: https://claude.ai/code/session_01V1YkHZ2Ad1okwkX9bxTYsd
The remaining flake is npm, not the CLI. `npm ci`'s default `fetch-timeout`
is 300000ms — identical to the test's 300s beforeAll budget — so a single
stalled npm registry connection hangs setup until the hook times out. (clerk
init's installSdk skips here because the isolated env has no PATH, so npm ci is
the only unbounded npm install.)

- npm ci: add --fetch-timeout=60000 --fetch-retries=5 so a stalled fetch aborts
  at 60s and retries, mirroring the CLI's loggedFetch timeout.
- Restore the debug-gated git/link/init/npm step markers so any residual hang
  names the exact step instead of an opaque "hook timed out".

Claude-Session: https://claude.ai/code/session_01V1YkHZ2Ad1okwkX9bxTYsd
The persistent 300s beforeAll hang was npm, not the CLI. npm's default
fetch-timeout is 300000ms, so one stalled registry connection during either
npm operation in setup blocks until the test budget expires. The previous
commit bounded `npm ci` but missed the other one: `clerk init` runs an internal
`npm install @clerk/<sdk>` (installSdk), which was still unbounded — that's what
hung the Vue fixture at 300007ms.

Write a project `.npmrc` (fetch-timeout=30s, fetch-retries=3) before any npm
runs. Both `clerk init`'s install and `npm ci` use projectDir as cwd, so it
covers both: a stalled fetch now aborts in 30s and retries on a fresh
connection instead of waiting 5 minutes. Worst case ~120s, safely under the
300s budget. Drops the redundant per-command npm flags.

Claude-Session: https://claude.ai/code/session_01V1YkHZ2Ad1okwkX9bxTYsd
Across four CI runs the 300s beforeAll hang moved randomly between fixtures
AND steps — including `git init`, a local, near-instant, near-silent command.
That rules out npm, the network, loggedFetch and the earlier Bun.spawn pipe
deadlock: the only thing that explains a trivial `git` subprocess hanging 300s
intermittently and only under `--parallel` is Bun.$ subprocess spawning/reaping
stalling under high concurrent load (each of 4 workers spawns git + 2 `bun`
CLIs + npm + a dev server + chromium at once).

Run fixtures serially (`--parallel=1`, still isolated) so at most one fixture's
subprocesses run at a time. Bump the E2E job timeout 30->45m for the slower
serial run. Keeps the .npmrc fetch-timeout and loggedFetch fixes.

Claude-Session: https://claude.ai/code/session_01V1YkHZ2Ad1okwkX9bxTYsd
Serializing fixtures fixed the contention-driven setup hangs, but exposed a
second, independent flake: `clerk link` (and `init`) intermittently hang ~300s
in a non-fetch path the CLI's loggedFetch timeout can't bound — in human mode
they shell out to git and can stall on a git subprocess or prompt. It lands on a
different fixture each run, so it's transient, not deterministic.

Wrap both CLI steps in withRetry: a stall trips a hard timeout (90s/120s, above
loggedFetch's 60s so genuinely-slow API calls aren't pre-empted) and the retry
runs a fresh subprocess. Promise.race abandons the hung process (no stream
deadlock); beforeAll isn't retried so the orphan can't cascade.
Harden the setup against the intermittent Bun.$ subprocess stall (a spawned
git/clerk/npm step occasionally never resolves — verified a Promise.race
timeout still fires during the hang, so a retry recovers it).

- withRetry now wraps every step: git init, clerk link, clerk init, npm ci.
  A hung attempt is abandoned at its budget and a fresh subprocess retried.
- Tighten the project .npmrc (fetch-timeout 30s->20s, retries 3->2) so a real
  npm stall resolves well under the step budgets and can't false-trip them.
- Restore --parallel=4 (retry absorbs the higher hang frequency) and revert the
  E2E job timeout to 30m.

Keeps the loggedFetch 60s request timeout (bounds the CLI's own API calls).

Claude-Session: https://claude.ai/code/session_01V1YkHZ2Ad1okwkX9bxTYsd
@rafa-thayto rafa-thayto force-pushed the rafa-thayto/webhooks branch from 99624a2 to 0bcbca4 Compare June 23, 2026 15:46
The retry on `clerk link` was making things worse: attempt 1 writes the profile
then the process intermittently hangs (a lingering handle after setProfile, not
a fetch — confirmed AbortSignal.timeout is unref'd), so withRetry kills it at
90s; attempt 2 then ran `clerk link --mode human` on the now-linked project,
hit the interactive "re-link?" confirm prompt, and failed with "Already linked"
(3/3 rerun sample failed this way).

Run link in `--mode agent`: on an already-linked project it prints status and
exits 0 instead of prompting, so the retry's second attempt succeeds. `clerk
init` is already idempotent on re-run ("Clerk is already set up" -> exit 0).
…; gate listen deliveries until setup completes
…tdin pipes

- delete / secret --rotate / replay --since now run the --yes/prompt gate
  before resolveAppContext, so agent mode gets the deterministic usage
  error without a network round-trip (and regardless of key validity)
- the implicit piped-stdin --input-json expansion now stands down when a
  literal '-' is in argv, fixing 'verify --delivery -' / '--payload -'
  which previously had their stdin consumed and rejected as nested JSON
…he inbox URL

Live-relay verified: play.svix.com returns 400 'Invalid token' for
unprefixed tokens, and the relay only registers an inbox when the start
frame carries the same c_ token. With c_ in both, a POST to the inbox
round-trips through the WebSocket and the reply frame is accepted —
proven end-to-end against the real relay with no PLAPI involvement.
Reverses spec change #12 (recorded as spec change #27).
…li-program

- Import createOption from @commander-js/extra-typings (used by webhooks messages --status)
- Import parseIntegerOption from lib/option-parsers (used by webhooks list and messages --limit)
- Remove stray conflict-marker text fragments left by conflict resolution
…ndling

- splitCommaList now returns undefined for empty/whitespace-only values so
  callers treat them as "not provided" rather than sending an empty array
- list now prints the iterator hint when paginating
- add relay-client tests plus list/replay/update/verify coverage
- README: clarify keepalive probe timing and JSON-mode type discriminator

Claude-Session: https://claude.ai/code/session_01Mwcxk4pmfNYtmvjwWs9jUE
Adversarial audit follow-ups on the unreleased webhooks group:

- verify: reject an explicit empty --payload as a usage error instead of
  hashing an `undefined` pre-image and silently failing (exit 2, not 1).
- create: propagate an AuthError from the post-create secret fetch instead
  of masking it as "secret unavailable"; tag the genuine partial-failure
  with the new webhook_secret_fetch_failed code for agent branching.
- replay: `--until` alone now points at the missing --since rather than
  emitting the vaguer "pass <msg_id> or --since" hint.
- relay-client: route the 1008 token-collision redial through the standard
  reconnect backoff (no zero-delay storm) and guard onopen against a stop()
  that races socket construction.
- README: document at-least-once redelivery on reconnect (handlers must
  key on svix-id).

Adds tests for each fix. Full suite 1934 pass / 0 fail.

Claude-Session: https://claude.ai/code/session_015J6Sduw5KeHz6SxLEBfViF
@rafa-thayto rafa-thayto force-pushed the rafa-thayto/webhooks branch from d96c2ca to a1352d3 Compare June 25, 2026 13:08
…ant pattern

Run the local relay tunnel with no Clerk backend via `listen --relay-only`
(skips PLAPI endpoint provisioning and the group auth gate, forces verification
off). Persist the relay token per instance so the relay URL is stable across
restarts, and add `--token <c_…>` to pin a deterministic, shareable URL.

Move the webhooks command tree into `registerWebhooks(program)` exported from
commands/webhooks/index.ts and wire it through cli-program's registrants array,
matching the project's command-registration pattern. Add a .claude rule that
documents the pattern when cli-program.ts or a command index.ts is edited.

Harden the command group's edge cases (svix_app_missing handling, friendly
API errors, --limit validation, header forwarding) and extract the SIGINT
handler into lib/signals.ts.

Claude-Session: https://claude.ai/code/session_01SYYJBsRxBQjCAuNbQiiLma
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant